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(54) Extending cluster membership and quorum determinations to intelligent storage systems 

(57) One embodiment of the present invention pro- 
vides a system that establishes an agreement between 
members of a cluster of nodes in a clustered computing 
system for purposes of forming a new cluster if there is 
a failure to communicate with a member of the cluster. 
The system operates by detecting a failure to commu- 
nicate with the member of the cluster. In response to 
detecting the failure, the system attempts to establish 
an agreement between a group of nodes in the cluster 
that constitute a quorum of the members of the cluster. 
This process of establishing an agreement involves in- 
itiating communications from the intelligent storage de- 
vice controller with other nodes in the cluster. The intel- 
ligent storage device controller additionally controls at 
least one storage device. If such agreement is estab- 
lished, the system forms the new cluster from the mem- 
bers of the cluster that have reached agreement. 
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Description 
BACKGROUND 
Field of the Invention 

[0001 ] The present invention relates to intelligent sec- 
ondary storage systems for computers. More specifical- 
ly, the present invention relates to a method and an ap- 
paratus for establishing a quorum between nodes in a 
clustered computing system that uses an intelligent stor- 
age device controller as a cluster member in establish- 
ing the quorum. 

Related Art 

[0002] In clustered computer systems, computing 
nodes are closely-coupled into a cluster of related com- 
puting nodes that can share resources and work togeth- 
er in completing computational tasks. One advantage of 
a clustered computing system is that a cluster, if de- 
signed property, can continue to operate even when 
nodes in the cluster fail, or when communication links 
between nodes in the cluster fail. 
[0003] A problem can arise if a clustered computing 
system somehow becomes partitioned into two sepa- 
rate groups of nodes that are unable to communicate 
with each other. For example, referring the FIG. 2A, sup- 
pose a communication link failure creates a partition 
200, which divides nodes 102-105 in two groups of 
nodes 1 02-1 03 and 1 04-1 05 that are unable to commu- 
nicate with each other. (The nodes typically detect such 
a failure by periodically exchanging "heartbeat" informa- 
tion with each other. If the heartbeat information is not 
properly received, some type of communication link or 
node failure has occurred.) Partitioning of a clustered 
computing system can create problems if both groups 
of nodes continue to perform the same tasks, because 
both groups can potentially attempt to operate on a 
shared resource, such as a shared file, at the same time. 
[0004] This problem is commonly referred to as the 
"split brain" problem. One way to prevent the split brain 
problem is to use a quorum mechanism that allows a 
group of nodes to access shared resources of the clus- 
tered computing system only if the group contains a ma- 
jority of the nodes of the clustered computing system. 
In this way, at most one group of nodes at a time can be 
granted access to the shared resources of the clustered 
computing system. This prevents the "split brain" prob- 
lem. 

[0005] However, even with a quorum mechanism, a 
clustered computing system can still have problems in 
certain situations. For example, in FIG. 2A, a tournode 
cluster is divided into two groups of nodes each of which 
has two nodes. In this case, neither of the groups of 
nodes, 1 02-1 03 nor 1 04-1 05, has a majority. Hence, nei- 
ther of the groups, 1 02-1 03 or 1 04-1 05, can access the 
shared resources of the clustered computing system. 



This problem can be somewhat alleviated by giving dif- 
ferent numbers of "votes" to different members of a clus- 
ter. For example, in FIG. 2A. if node 102 is given two 
votes for purposes of determining a quorum, nodes 

5 1 02-1 03 will have three of the five votes in the cluster 
and will hence have a majority of the votes in the cluster. 
This allows nodes 102-103 to establish a quorum. 
[0006] However, even assigning different numbers of 
votes to different nodes does not work in certain situa- 

10 tions. For example, suppose in FIG. 2A that nodes 1 02 
is given two votes, and suppose that nodes 1 02-1 03 fail. 
The remaining nodes 1 04-1 05 will not be able to estab- 
lish a quorum, in spite of the fact that no other nodes 
are active. 

15 [0007] Existing solutions to the problem of detecting 
failures in a clustered computing system and forming a 
new cluster have so far overlooked the possibility of us- 
ing other resources within the clustered computer sys- 
tem. For example, intelligent storage device controllers 

20 now have enough processing power and memory to 
help in the process of forming a cluster. 
[0008] What is needed is a method and an apparatus 
that makes use of other resources within a clustered 
computing system, such as intelligent storage device 

25 controllers, to aid in the process of detecting failures in 
the clustered computing system and reconfiguring the 
clustered computing system. 

SUMMARY 

30 

[0009] One embodiment of the present invention pro- 
vides a system that establishes an agreememt between 
members of a cluster of nodes in a clustered computing 
system for purposes of forming a new cluster if there is 

35 a failure to communicate with a member of the cluster. 
The system operates by detecting a failure to commu- 
nicate with the member of the cluster. In response to 
detecting the failure, the system attempts to establish 
an agreement between a group of nodes in the cluster 

40 that constitute a quorum of the members of the cluster. 
This process of establishing an agreement involves in- 
itiating communications from the intelligent storage de- 
vice controller with other nodes in the cluster. The intel- 
ligent storage device controller additionally controls at 

45 (east one storage device. If such agreement is estab- 
lished, the system forms the new clusterfrom the mem- 
bers of the cluster that have reached agreement. 
[0010] In one embodiment of the present invention, 
the quorum contains a subset of the members of the 

so cluster with more than one half of a number of votes that 
are distributed between the members of the cluster. In 
a variation on this embodiment, different members of the 
cluster can have different numbers of votes for purposes 
of establishing the quorum. In another variation on this 

55 embodiment, the intelligent storage device controller 
has one vote for each storage device under its control 
for purposes of establishing the quorum. 
[0011] In one embodiment of the present invention, 
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the system detects the failure by periodically communi- 
cating with other members of the cluster in order to de- 
termine whether communications with the other mem- 
bers of the cluster have failed. In a variation on this em- 
bodiment, communicating with the other members of the 
cluster involves communicating information that ena- 
bles members of the cluster to verify that they have a 
consistent view of a state of the cluster. 
[0012] In one embodiment of the present invention, 
the system detects the failure by periodically receiving 
information at the intelligent storage device controller 
from other members of the cluster. This information en- 
ables the intelligent storage device controller to deter- 
mine whether communications with the other members 
of the cluster have failed. 

[0013] In one embodiment of the present invention, in 
forming the new cluster the system additionally ex- 
cludes nodes in the clustered computing system that are 
not members of the new cluster from accessing shared 
resources of the clustered computing system. 
[0014] In one embodiment of the present invention, 
the intelligent storage device controller can additionally 
process both file-level and block-level accesses to a 
storage device. 

[0015] In one embodiment of the present invention, 
cluster membership gives a node access to shared re- 
sources of the clustered computing system. 

BRIEF DESCRIPTION OF THE FIGURES 

[001 6] Fl G . 1 il lustrates a cl ustered computing system 
in accordance with an embodiment of the present inven- 
tion. 

[0017] FIG. 2A illustrates an example of a partition be- 
tween nodes in a clustered computing system. 
[0018] FIG. 2B illustrates an example of partition be- 
tween nodes in another clustered computing system in 
accordance with an embodiment of the present inven- 
tion. 

[001 9] FIG. 3 illustrates system layers involved in per- 
forming a file system access wherein block-level com- 
mands are transferred to a storage system in accord- 
ance with an embodiment of the present invention. 
[0020] Fl G . 4 il lustrates system layers involved in per- 
forming a file system access wherein file-level com- 
mands are transferred to a storage system in accord- 
ance with an embodiment of the present invention. 
[0021 ] FIG. 5 is a flow chart illustrating the process of 
detecting a communication failure in accordance with an 
embodiment of the present invention. 

DETAILED DESCRIPTION 

[0022] The following description is presented to ena- 
ble any person skilled in the art to make and use the 
invention, and is provided in the context of a particular 
application and its requirements. Various modifications 
to the disclosed embodiments will be readily apparent 



to those skilled in the art, and the general principles de- 
fined herein may be applied to other embodiments and 
applications without departing from the spirit and scope 
of the present invention. Thus, the present invention is 
s not intended to be limited to the embodiments shown, 
but is to be accorded the widest scope consistent with 
the principles and features disclosed herein. 
[0023] The data structures and code described in this 
detailed description are typically stored on a computer 
10 readable storage medium, which may be any device or 
medium that can store code and/or data for use by a 
computer system. This includes, but is not limited to, 
magnetic and optical storage devices such as disk 
drives, magnetic tape, CDs (compact discs) and DVDs 
15 (digital video discs), and computer instruction signals 
embodied in a transmission medium (with or without a 
carrier wave upon which the signals are modulated). For 
example, the transmission medium may include a com- 
munications network, such as the Internet. 

20 

Clustered Computing System 

[0024] FIG. 1 illustrates clustered computing system 
100 in accordance with an embodiment of the present 

25 invention. Clustered computing system 100 includes 
nodes 1 02-1 05, which are coupled to intelligent storage 
system 120 through network 106. Nodes 102-105 are 
clustered together so that they can share resources, 
such as intelligent storage system 120. Nodes 102-105 

30 can include any type of computers, including, but not 
limited to, computers based upon microprocessors, 
mainframe processors, device controllers, and compu- 
tational engines within appliances. Note that nodes 
102-105 may additionally include semiconductor mem- 

55 ory as well as other computer system components. 
[0025] In the illustrated embodiment, nodes 102-105 
are coupled together through network 1 06. Network 1 06 
can include any type of wire or wireless communication 
channel capable of coupling together nodes 102-105 

40 and intelligent storage system 120. This includes, but is 
not limited to, a local area network, a wide area network, 
or a combination of networks. In the illustrated embod- 
iment, network 106 includes switches 140 and 141, 
each of which has separate connections to nodes 

45 102-105 and intelligent storage system 120. This pro- 
vides fault-tolerance, because if either of switches 140 
or 1 41 fails, the other switch can continue to provide net- 
work connectivity. 

[0026] In one embodiment of the present invention, 
50 nodes 102-105 provide fault-tolerance by allowing 
failovers between nodes 102-105. For example, sup- 
pose node 1 02 is acting as a primary server and node 
1 03 is acting as a backup secondary server. If node 1 02 
fails, node 103 can take over the server functions per- 
55 formed by node 102. 

[0027] Intelligent storage system 120 includes con- 
troller 1 22 and cache 1 24, which are coupled to storage 
devices 130-133. Intelligent storage system 120 addi- 
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tionally includes controller 123 and cache 125, which 
are similarly coupled to storage devices 130-133. Fur- 
thermore, each of controllers 122 and 123 have sepa- 
rate linkages to switches 140 and 141 within network 
106. 5 
[0028] Note that providing two controllers 122-123 
and two caches 1 24 -1 25 allows intelligent storage sys- 
tem 120 to provide fault-tolerance in cases where one 
of the two controllers 1 22-123 or one of the two caches 
124-125 fails. Also note that the present invention can 
additionally apply to non-fault-tolerant storage systems 
that include only a single controller and a single cache. 
[0029] Controllers 122-123 can include any type of 
computational devices that can be configured to act as 
controllers for storage devices. During operation, con- 
trollers 122-123 manage a number of components, in- 
cluding storage devices 130-133 and caches 124-125. 
Controllers 1 22-1 23 can also manage mirroring of cach- 
es 124-125. This keeps caches 124 and 125 consistent 
with each other for fault-tolerance purposes. 
[0030] Note that controllers 1 22-1 23 typically operate 
under control of a file server located on one of nodes 
102-105. Controllers 122-123 can process block-level 
access requests from the file server in the form of a stor- 
age device identifier and a block number. 
[0031] . Caches 124 and 125 can include any type of 
random access memory for caching data from nodes 
102-105. In one embodiment of the present invention, 
caches 124-125 include non-volatile random access 
memory based upon flash memory or battery backed up 
memory. This allows a transaction to be permanently 
committed into non-volatile storage in caches 124-125 
without having to wait for storage devices 1 30-1 33. 
[0032] Storage devices 1 30-1 33 can include any type 
of non-volatile storage devices for storing code and/or 
data for nodes 1 02-1 05. This includes, but is not limited 
to, magnetic storage devices (such as disk drives or 
tape drives), optical storage devices and magneto-opti- 
cal storage devices. This also includes non-volatile 
semiconductor storage devices, such as flash memo- 
ries or battery-backed up random access memories. In 
one embodiment of the present invention storage, stor- 
age devices 130-133 include disk drives. 
[0033] If no failure occurs, intelligent storage system 
1 20 operates generally as follows. A node, such as node 
1 02, makes an access request, such as a read opera- 
tion, to intelligent storage system 120. Controller 122 
within intelligent storage system 120 receives the ac- 
cess request and tries to satisfy the access request with- 
in local cache 124. If necessary, controller 1 22 sends a 
request to a storage device, such as storage device 1 30, 
to complete the request. If the access is a read opera- 
tion, storage device 130 returns the requested data, and 
controller 1 22 forwards to requested data back to node 
102. 



Partition 

[0034] FIG. 2A illustrates an example of a partition be- 
tween nodes 102-105 in clustered computing system 
100. As mentioned above, partition 200 separates 
nodes 102-103 from nodes 104-105. Nodes 102-103 
detect the partition 200 when a periodic heartbeat/mon- 
itoring process fails to communicate with nodes 
104-105. Nodes 102-103 then communicate with each 
other in an attempt to establish a quorum. Similarly, 
nodes 104-105 communicate with each other in an at- 
tempt to establish a quorum. Since neither nodes 
102-103 nor node 104-105 have a majority of the votes 
in the cluster, neither nodes 1 02-1 03 nor nodes 1 04-1 05 
can establish a quorum. 

[0035] FIG. 2B illustrates an example of partition be- 
tween nodes in clustered computing system 1 00 in ac- 
cordance with an embodiment of the present invention. 
In this case, controllers 1 22-1 23 from intelligent storage 
device 120 from FIG. 1 are also members of the cluster, 
and can participate in heartbeat/monitoring operations 
as well as in the process of establishing a quorum. After 
the partition 200 takes place, nodes 1 02-1 03 attempt to 
establish a quorum, but. are unsuccessful because they 
only have two out of the total of six votes in the cluster. 
Nodes 1 04-1 05 and controllers 122-123 also communi- 
cate with each other in an attempt to establish a quorum. 
In this case, they successfully establish quorum 202 be- 
cause they possess four of the six votes in the cluster. 
In this way, controllers 1 22-1 23 within intelligent storage 
device 120 can aid in the process of detecting a failure 
and in reforming the cluster. 

System Layers 

[0036] FIG . 3 illustrates system layers involved in per- 
forming a file system access wherein block-level com- 
mands are transferred to intelligent storage system 120 
in accordance with an embodiment of the present inven- 
tion. These layers are used during one mode of opera- 
tion for intelligent storage system 120. During another 
mode of operation, intelligent storage system 120 can 
receive file system commands. 
[0037] A number of layers are illustrated within node 
102. Application 302 first initiates a file system access 
by executing an instruction that causes a file system call. 
This file system access may include a read operation, 
a write operation, or any other type of file access or file 
maintenance operation. The file system access passes 
into file system 304, which converts the file system ac- 
cess into lower-level commands to access logical blocks 
of storage. Logical volume manager 306 receives these 
lower-level commands and converts them into even low- 
er-level block access commands. Note that logical vol- 
ume manager 306 may additionally perform mirroring 
for fault-tolerance purposes. Logical volume manager 
306 passes the block-level commands to SCSI driver 
308. SCSI driver 308 converts the block-level com- 



15 



20 



25 



30 



35 



40 



45 



50 



4 



7 



EP1 107 119 A2 



8 



mands into a form that adheres to the SCSI protocol. 
SCSI driver 308 passes the commands through trans- 
port interface 310, which converts the SCSI protocol 
command into a form that is suitable for transmission 
over a transport link. 

[0038] Within intelligent storage system 120, the 
block-level command is received from node 102 across 
transport link 31 1 and is converted back into a SCSI pro- 
tocol command at transport interface 312, and is then 
passed into SCSI emulator31 4. SCSI emulator314 pro- 
vides an interface that appears to be a dumb SCSI de- 
vice, such as a disk drive. Within SCSI emulator 314, 
the block-level command is converted into a format that 
allows the block to be looked up within cache 1 24. If the 
access request cannot be serviced entirely within cache 
124, the access request passes through logical volume 
manager 317 to controller 318. Controller 318 passes 
the access request through SCSI device driver 320 and 
through SCSI to fiber channel converter 322 before for- 
warding the request to storage device 130. 
[0039] Note that a block level access can be serviced 
from cache 124 (without accessing storage device 130) 
if the block-level access is a read operation, and the re- 
quested block is present in cache 124. During a write 
operation, a block that is written into cache 124 will 
eventually be written back to storage device 130. 
[0040] Note that transport interface 312 within intelli- 
gent storage system 120 is additionally coupled to clus- 
ter forming mechanism 321 and failure detection mech- 
anism 323. Failure detection mechanism 323 performs 
heartbeat and monitoring functions that determine 
whether or not communication with another node in the 
cluster has failed. Cluster forming mechanism 321 
seeks to form a quorum between members of the cluster 
after the failure has been detected so that membership 
of the cluster can be reformed. 
[0041 ] Fl G . 4 il lustrates system layers involved in per- 
forming a file system access wherein file-level com- 
mands are transferred to a storage system in accord- 
ance with an embodiment of the present invention. In 
this embodiment, application 402 generates a file sys- 
tem access and this access is passed into file system 
402 (which is a client-side portion of a distributed file 
system). This file system access is immediately passed 
into transport interface 404, wh ich packages the file sys- 
tem access for transport across a communication link 
from node 102 to intelligent storage system 120. 
[0042] Within intelligent storage system 120, the file 
system access passes through transport interface 406, 
which unpackages the file system access and then 
passes it into file system 408 (which is a server-side por- 
tion of a distributed file system). 
[0043] In one embodiment of the present invention, 
file system 402 (on the client side) and file system 408 
(on the server side) act in concert to provide high avail- 
ability. For example, suppose node 1 02 fails during a file 
system operation. The highly available system allows a 
secondary backup node, such as node 103, to continue 



operating in place of node 102. Note that the present 
invention also applies to computer systems that do not 
provide high availability. 

[0044] File system 408 within intelligent storage sys- 
5 tern 1 20 passes the file system access to underlying file 
system 41 0. (Note that in general any file system can 
be used to implement underlying file system 410.) Un- 
derlying file system 41 0 attempts to satisfy the file sys- 
tem request from cache 124. If a further access is re- 
10 quired to storage device 130, the file system access is 
converted into a block-level request. This block-level re- 
quest passes through logical volume manager 31 7 to 
SCSI device driver 320. Next, the block-level request 
passes through SCSI device driver 320 and is converted 
is into a format suitable for transmission over a communi- 
cation channel adhering to the fiber channel standard in 
block 322. This block-level request is then forwarded to 
storage device 130. 

[0045] Note that the present invention allows for two 
20 modes of operation. A first mode of operation (illustrated 
in FIG. 4) allows intelligent storage svstem 1 20 to accept 
higher-level file access commands. A second mode of 
operation (illustrated in FIG. 3) allows intelligent storage 
system 120 to accept lower-level block access com- 
25 mands. 

[0046] The first mode of operation eliminates the work 
involved in converting an access request into a block- 
level form within node 1 02, and then emulating a simple 
SCSI device with intelligent storage system 120, which 

30 converts the access into a higher-level form that is sub- 
sequently converted back down into a lower-level form 
before passing to storage device 130. 
[0047] Instead, the first mode of operation sends the 
higher-level file system access directly to intelligent 

35 storage system 120 without first converting it into a 
block-level form. 

[0048] Also note that as in FIG. 3, transport interface 
312 within intelligent storage system 120 is additionally 
coupled to cluster forming mechanism 321 and failure 
40 detection mechanism 323. 

Process of Detecting a Communication Failure and 
Reconfiguring a Cluster 

45 [0049] FIG. 5 is a flow chart illustrating the process of 
detecting a communication failure and reconfiguring a 
cluster in accordance with an embodiment of the 
present invention. The system starts by sending heart- 
beat information from a node to other nodes in the clus- 

50 ter (step 502). Note that controllers 122 and 123 from 
intelligent storage device 1 20 participate in this process 
of sending heartbeat information. The node also moni- 
tors heartbeat information received from other nodes in 
clustered computing system 1 00 (step 504). From this 

55 heartbeat information, the node determines whether 
there has been a failure in communicating within anoth- 
er node in clustered computing system 1 00, or if nodes 
in clustered computing system 1 00 do not have a con- 
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sistent view, of the state of the cluster. 
[0050] The node may also be able to determine that 
there has been a failure indirectly, by receiving a com- 
munication from another node in the cluster indicating 
that there has been a failure. If no failure is detected at 
step 506, the node returns to step 502 to send and mon- 
itor heartbeat information again. If a failure is detected, 
the system attempts to establish an agreement between 
a quorum of the members of the cluster (step 508). This 
involves sending communications from the node to oth- 
er nodes in the cluster in an attempt to reach an agree- 
ment between enough nodes to establish a quorum. 
Note that controllers 122-123 from intelligent storage 
device 120 have the same capabilities as nodes 
1 02-1 05 in the process of reforming the cluster, and can 
initiate communications with other nodes in the cluster 
to establish a quorum. Also note that cluster members 
can have different numbers of votes in the quorum de- 
termination. For example, an important node, such as 
file server may have more votes than another node. In 
a second example, each controller 122-123 is given a 
vote for each storage device under its direct control. 
[0051] The system next determines if the agreement 
has been reached (step 510). If agreement cannot be 
reached, the node can signal an error condition (step 
516). Alternatively, the node can simply continue to op- 
erate without having access to the shared resources of 
the cluster. Note that these shared resources can in- 
clude, but are not limited to, storage devices, memories, 
communication linkages, and any other shared resourc- 
es of clustered computing system 100. 
[0052] If agreement can be reached, the system 
forms a new cluster between the nodes that agree (step 
51 2), and the other nodes are fenced out of the cluster 
(step 51 4) so that they can no longer have access to the 
shared resources of the cluster and can no longer par- 
ticipate in performing to tasks assigned to the cluster. 
The nodes within the new cluster then continue to oper- 
ate as they: did prior to the failure. 
[0053] The foregoing descriptions of embodiments of 
the invention have been presented for purposes of illus- 
tration and description only. They are not intended to be 
exhaustive or to limit the invention to the forms dis- 
closed. Accordingly, many modifications and variations 
will be apparent to practitioners skilled in the art. Addi- 
tionally, the above disclosure is not intended to limit the 
invention. The scope of the invention is defined by the 
appended claims. 



Claims 

1. A method for establishing an agreement between 
members (1 02, 1 03, 1 04, 1 05) of a cluster of nodes 
in a clustered computing system (1 00) for purposes 
of forming a new cluster if there is a failure to com- 
municate with a member of the cluster, the clustered 
computing system including an intelligent storage 



device controller (122, 123) that acts as a cluster 
member during a process of reforming the cluster, 
the method comprising: 

5 detecting a failure to communicate with the 

member of the cluster; and 
in response to detecting the failure, attempting 
(508) to f orm a new cluster by, attempting to es- 
tablish an agreement between a group of mem- 

10 bers of the cluster, the group including at least 

a quorum of the members of the cluster, 
wherein attempting to establish the agreement 
includes initiating communications from the in- 
telligent storage device controller (122, 123) 

15 with other nodes in the cluster (1 02, 1 03, 1 04, 

105), and 

if the agreement is established, forming the new 
cluster (202) from the group of members of the clus- 
20 ter that have reached agreement. 

2. The method of claim 1 , wherein the quorum con- 
tains a subset of the members of the cluster with 
more than one half of a number of votes that are 

25 distributed between the members of the cluster. 

3. The method of claim 1 or claim 2, wherein detecting 
the failure includes periodically communicating 
(502, 504) with other members (1 02, 1 03, 1 04, 1 05) 

30 of the cluster in order to determine whether commu- 
nications with the other members of the cluster have 
failed. 

4. The method of claim 3, wherein communicating with 
35 the other members (1 02, 1 03, 1 04, 1 05) of the clus- 
ter includes communicating information that ena- 
bles members of the cluster to verify that they have 
a consistent view of a state of the cluster 

40 5. The method of claim 1 or claim 2, wherein detecting 
the failure includes periodically receiving informa- 
tion at the intelligent storage device controller (1 22, 
123) from other members of the cluster, the infor- 
mation enabling the intelligent storage device con- 

45 troller to determine whether communications with 
the other members of the cluster have failed. 

6. The method of claim 1 or claim 2, wherein detecting 
the failure includes being informed of the failure by 

so another member of the cluster. 

7. A method according to any preceding claim, where- 
in forming the new cluster further comprises exclud- 
ing nodes in the clustered computing system that 

55 are not members of the new cluster from accessing 
shared resources of the clustered computing sys- 
tem. 
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8. A method according to any preceding claim, where- 
in different members of the cluster can have differ- 
ent numbers of votes for purposes of establishing 
the quorum. 

9. A method according to any preceding claim, where- 
in the intelligent storage device controller has one 
vote for each storage device under its control for 
purposes of establishing the quorum. 

1 0. A method according to any preceding claim, where- 
in the intelligent storage device controller can proc- 
ess both tile-level and block-level accesses to a 
storage device. 

1 1 . A method according to any preceding claim, where- 
in cluster membership gives a node access to 
shared resources of the clustered computing sys- 
tem. 

1 2. A method according to any preceding claim, where- 
in the clustered computing system includes more 
than one intelligent storage device controller that 
acts as a cluster member. 

13. A method according to any preceding claim, further 
comprising controlling at least one storage device 
using the intelligent storage device controller. 

14. An apparatus that establishes an agreement be- 
tween members (1 02, 1 03, 1 04, 1 05) of a cluster of 
nodes in a clustered computing system (100) for 
purposes of forming a new cluster (202) if there is 
a failure to communicate with a member of the clus- 
ter, the apparatus comprising: 

an intelligent storage device controller (122, 
1 23) that controls at least one storage device 
(130, 131, 132, 133) for the clustered comput- 
ing system; 

a detection mechanism (323), within the intelli- 
gent storage device controller, that is config- 
ured to detect a failure to communicate with the 
member of the cluster; and 
a cluster forming mechanism (321), within the 
intelligent storage device controller; 

wherein if the failure is detected, the cluster forming 
mechanism (321) is configured to, attempt to estab- 
lish the agreement between a group of members of 
the cluster, the group including at least a quorum of 
the members of the cluster, and to 

form the new cluster from the group of mem- 
bers of the cluster that have reached agreement if 
the agreement is established. 

15. The apparatus of claim 14, wherein the quorum 
(202) contains a subset of the members (102, 103, 



104, 105, 122, 123) of the cluster with more than 
one half of a number of votes that are distributed 
between the members of the cluster. 

5 16. The apparatus of claim 1 4 or claim 1 5, wherein the 
detection mechanism (323) is configured to period- 
ically communicate with other members of the clus- 
ter in order to determine whether communications 
with the other members of the cluster have failed. 

10 

17. The apparatus of claim 16, wherein the detection 
mechanism (323) is configured to communicate in- 
formation that enables members of the cluster to 
verify that they have a consistent view of a state of 

is the cluster. 

18. The apparatus of claim 14 or claim 15, wherein the 
detection mechanism (323) is configured to period- 
ically receive information from other members of the 

20 cluster, the information enabling the intelligent stor- 
age device controllerto determine whether commu- 
nications with the other members of the cluster have 
failed. 

25 19. An apparatus according to any one of claims 14 to 

1 8, wherein the cluster forming mechanism (321) is 
configured to exclude nodes (1 02, 1 03) in the clus- 
tered computing system (1 00) that are not members 
of the new cluster (202) from accessing shared re- 

30 sources of the clustered computing system. 

20. An apparatus according to any one of claims 14 to 

1 9, wherein different members of the cluster can 
have different numbers of votes for purposes of es- 

35 tablishing the quorum. 

21 . An apparatus according to any one of claims 14 to 

20, wherein the intelligent storage device controller 
(122, 123) has one vote for each storage device 

40 (130, 131, 132, 133) under its control for purposes 
of establishing the quorum (202). 

22. An apparatus according to any one of claims 14 to 

21 , wherein the intelligent storage device controller 
45 (122, 123) is configured to process both file-level 

and block-level accesses to a storage device (130, 
131,132,133). 

23. An apparatus according to any one of claims 14 to 
so 22, wherein cluster membership gives a node ac- 
cess to shared resources of the clustered comput- 
ing system (1 00). 

24. An apparatus according to any one of claims 14 to 
55 23, wherein the clustered computing system in- 
cludes more than one intelligent storage device 
controller (1 22, 1 23) that acts as a cluster member. 
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25. A computer readable storage medium storing in- 
structions that when executed by a computer cause 
the computer to perform a method for establishing 
an agreement between members of a cluster of 
nodes in a clustered computing system for purpos- 5 
es of forming a new cluster if there is a failure to 
communicate with a member of the cluster, the clus- 
tered computing system including an intelligent 
storage device controller that acts as a cluster 
member during a process of reforming the cluster, 10 
according to any one of claims 1 to 13. 
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EP 24 6218B EQUIVALENT-ABSTRACTS: 
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as well as of several logical decision units ( VOTERS ) which derive the correctness 
of an information message from the predetermined ma j ority of information messages, 
said majority of information messages relating to the total number of computer 
systems, characterised in that an input /output voter (IOV) is assigned to each 
computer systems (SRU) which votes on the input- or output messages, respectively, 
of the input- (IC) or output channels (OC) , respectively, of all computer systems 
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(SRU) of a same voting node (TMR) , whereby each voting node consists of several 
computer systems (SRUs) each input/output voter (IOV) passes on the contents of the 
information message (IM) it receives via the input channels (IC l...lCn) or the 
output channels (OC 1, ...OCm), respectively, to all other input/output voters in 
the same voting node (TMR) by issuing a voting message (VM) by means of voting 
links (VL) and then each input/output voter (VM) and all voting messages (VM) it 
receives from the other input/output voters of the same voting node, to each 
computer systems (SRU) a seguence voter (SV) is assigned to which the message 
recognized to be the correct one to decide on the correct sequence of the 
information messages to be processed during the input/output voting is supplied, 
each seguence voter (SV) transmits the contents of the information message it 
receives to all other sequence voters (SV) in the same voting node (TMR) by issuing 
a sequence voting message by means of voting links, then performs voting on all 
received sequence voting messages including its own message and transmits the 
uniform message information sequence recognised to be the correct one of its own 
application process (AP) . 
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